knowledge graph builder

安装量: 180
排名: #4772

安装

npx skills add https://github.com/daffy0208/ai-dev-standards --skill 'Knowledge Graph Builder'
Knowledge Graph Builder
Build structured knowledge graphs for enhanced AI system performance through relational knowledge.
Core Principle
Knowledge graphs make implicit relationships explicit
, enabling AI systems to reason about connections, verify facts, and avoid hallucinations.
When to Use Knowledge Graphs
Use Knowledge Graphs When:
✅ Complex entity relationships are central to your domain
✅ Need to verify AI-generated facts against structured knowledge
✅ Semantic search and relationship traversal required
✅ Data has rich interconnections (people, organizations, products)
✅ Need to answer "how are X and Y related?" queries
✅ Building recommendation systems based on relationships
✅ Fraud detection or pattern recognition across connected data
Don't Use Knowledge Graphs When:
❌ Simple tabular data (use relational DB)
❌ Purely document-based search (use RAG with vector DB)
❌ No significant relationships between entities
❌ Team lacks graph modeling expertise
❌ Read-heavy workload with no traversal (use traditional DB)
6-Phase Knowledge Graph Implementation
Phase 1: Ontology Design
Goal
Define entities, relationships, and properties for your domain Entity Types (Nodes): Person, Organization, Location, Product, Concept, Event, Document Relationship Types (Edges): Hierarchical: IS_A, PART_OF, REPORTS_TO Associative: WORKS_FOR, LOCATED_IN, AUTHORED_BY, RELATED_TO Temporal: CREATED_ON, OCCURRED_BEFORE, OCCURRED_AFTER Properties (Attributes): Node properties: id, name, type, created_at, metadata Edge properties: type, confidence, source, timestamp Example Ontology :

RDF/Turtle format

@prefix : < http://example.org/ontology#

. : Person a owl : Class ; rdfs : label "Person" . : Organization a owl : Class ; rdfs : label "Organization" . : worksFor a owl : ObjectProperty ; rdfs : domain : Person ; rdfs : range : Organization ; rdfs : label "works for" . Validation : Entities cover all domain concepts Relationships capture key connections Ontology reviewed with domain experts Classification hierarchy defined (is-a relationships) Phase 2: Graph Database Selection Decision Matrix : Neo4j (Recommended for most): Pros: Mature, Cypher query language, graph algorithms, excellent visualization Cons: Licensing costs for enterprise, scaling complexity Use when: Complex queries, graph algorithms, team can learn Cypher Amazon Neptune : Pros: Managed service, supports Gremlin and SPARQL, AWS integration Cons: Vendor lock-in, more expensive than self-hosted Use when: AWS infrastructure, need managed service, compliance requirements ArangoDB : Pros: Multi-model (graph + document + key-value), JavaScript queries Cons: Smaller community, fewer graph-specific features Use when: Need document DB + graph in one system TigerGraph : Pros: Best performance for deep traversals, parallel processing Cons: Complex setup, higher learning curve Use when: Massive graphs (billions of edges), real-time analytics Technology Stack : graph_database : 'Neo4j Community'

or Enterprise for production

vector_integration : 'Pinecone'

For hybrid search

embeddings : 'text-embedding-3-large'

OpenAI

etl : 'Apache Airflow'

For data pipelines

Neo4j Schema Setup
:
// Create constraints for uniqueness
CREATE
CONSTRAINT
person_id IF
NOT
EXISTS
FOR
(
p
:
Person
)
REQUIRE
p
.
id
IS
UNIQUE
;
CREATE
CONSTRAINT
org_name IF
NOT
EXISTS
FOR
(
o
:
Organization
)
REQUIRE
o
.
name
IS
UNIQUE
;
// Create indexes for performance
CREATE
INDEX
entity_search IF
NOT
EXISTS
FOR
(
e
:
Entity
)
ON
(
e
.
name
,
e
.
type
)
;
CREATE
INDEX
relationship_type IF
NOT
EXISTS
FOR
(
)
-
[
r
:
RELATED_TO
]
-
(
)
ON
(
r
.
type
,
r
.
confidence
)
;
Phase 3: Entity Extraction & Relationship Building
Goal
Extract entities and relationships from data sources Data Sources : Structured: Databases, APIs, CSV files Unstructured: Documents, web content, text files Semi-structured: JSON, XML, knowledge bases Entity Extraction Pipeline : class EntityExtractionPipeline : def init ( self ) : self . ner_model = load_ner_model ( )

spaCy, Hugging Face

self . entity_linker = EntityLinker ( ) self . deduplicator = EntityDeduplicator ( ) def process_text ( self , text : str ) -

List [ Entity ] :

1. Extract named entities

entities

self . ner_model . extract ( text )

2. Link to existing entities (entity resolution)

linked_entities

self . entity_linker . link ( entities )

3. Deduplicate and resolve conflicts

resolved_entities

self . deduplicator . resolve ( linked_entities ) return resolved_entities Relationship Extraction : class RelationshipExtractor : def extract_relationships ( self , entities : List [ Entity ] , text : str ) -

List [ Relationship ] : relationships = [ ]

Use dependency parsing or LLM for extraction

doc

self . nlp ( text ) for sent in doc . sents : rels = self . extract_from_sentence ( sent , entities ) relationships . extend ( rels )

Validate against ontology

valid_relationships

self
.
validate_relationships
(
relationships
)
return
valid_relationships
LLM-Based Extraction
(for complex relationships):
def
extract_with_llm
(
text
:
str
)
-
>
List
[
Relationship
]
:
prompt
=
f"""
Extract entities and relationships from this text:
{
text
}
Format: (Entity1, Relationship, Entity2, Confidence)
Only extract factual relationships.
"""
response
=
llm
.
generate
(
prompt
)
relationships
=
parse_llm_response
(
response
)
return
relationships
Validation
:
Entity extraction accuracy >85%
Entity deduplication working
Relationships validated against ontology
Confidence scores assigned
Phase 4: Hybrid Knowledge-Vector Architecture
Goal
Combine structured graph with semantic vector search Architecture : class HybridKnowledgeSystem : def init ( self ) : self . graph_db = Neo4jConnection ( ) self . vector_db = PineconeClient ( ) self . embedding_model = OpenAIEmbeddings ( ) def store_entity ( self , entity : Entity ) :

Store structured data in graph

self . graph_db . create_node ( entity )

Store embeddings in vector database

embedding

self . embedding_model . embed ( entity . description ) self . vector_db . upsert ( id = entity . id , values = embedding , metadata = entity . metadata ) def hybrid_search ( self , query : str , top_k : int = 10 ) -

SearchResults :

1. Vector similarity search

query_embedding

self . embedding_model . embed ( query ) vector_results = self . vector_db . query ( vector = query_embedding , top_k = 100 )

2. Graph traversal from vector results

entity_ids

[ r . id for r in vector_results . matches ] graph_results = self . graph_db . get_subgraph ( entity_ids , max_hops = 2 )

3. Merge and rank results

merged

self
.
merge_results
(
vector_results
,
graph_results
)
return
merged
[
:
top_k
]
Benefits of Hybrid Approach
:
Vector search: Semantic similarity, flexible queries
Graph traversal: Relationship-based reasoning, context expansion
Combined: Best of both worlds
Phase 5: Query Patterns & API Design
Common Query Patterns
:
1. Find Entity
:
MATCH
(
e
:
Entity
{
id
:
$entity_id
}
)
RETURN
e
2. Find Relationships
:
MATCH
(
source
:
Entity
{
id
:
$entity_id
}
)
-
[
r
]
-
(
target
)
RETURN
source
,
r
,
target
LIMIT
20
3. Path Between Entities
:
MATCH
path
=
shortestPath
(
(
source
:
Person
{
id
:
$source_id
}
)
-
[
*
..
5
]
-
(
target
:
Person
{
id
:
$target_id
}
)
)
RETURN
path
4. Multi-Hop Traversal
:
MATCH
(
p
:
Person
{
name
:
$name
}
)
-
[
:
WORKS_FOR
]
->
(
o
:
Organization
)
-
[
:
LOCATED_IN
]
->
(
l
:
Location
)
RETURN
p
.
name
,
o
.
name
,
l
.
city
5. Recommendation Query
:
// Find people similar to this person based on shared organizations
MATCH
(
p1
:
Person
{
id
:
$person_id
}
)
-
[
:
WORKS_FOR
]
->
(
o
:
Organization
)
<-
[
:
WORKS_FOR
]
-
(
p2
:
Person
)
WHERE
p1
<>
p2
RETURN
p2
,
COUNT
(
o
)
AS
shared_orgs
ORDER
BY
shared_orgs
DESC
LIMIT
10
Knowledge Graph API
:
class
KnowledgeGraphAPI
:
def
init
(
self
,
graph_db
)
:
self
.
graph
=
graph_db
def
find_entity
(
self
,
entity_name
:
str
)
-
>
Entity
:
"""Find entity by name with fuzzy matching"""
query
=
"""
MATCH (e:Entity)
WHERE e.name CONTAINS $name
RETURN e
ORDER BY apoc.text.levenshtein(e.name, $name)
LIMIT 1
"""
return
self
.
graph
.
run
(
query
,
name
=
entity_name
)
.
single
(
)
def
find_relationships
(
self
,
entity_id
:
str
,
relationship_type
:
str
=
None
,
max_hops
:
int
=
2
)
-
>
List
[
Relationship
]
:
"""Find relationships within specified hops"""
query
=
f"""
MATCH (source:Entity {{id: $entity_id}})
MATCH path = (source)-[r*1..
{
max_hops
}
]-(target)
RETURN path, relationships(path) AS rels
LIMIT 100
"""
return
self
.
graph
.
run
(
query
,
entity_id
=
entity_id
)
.
data
(
)
def
get_subgraph
(
self
,
entity_ids
:
List
[
str
]
,
max_hops
:
int
=
2
)
-
>
Subgraph
:
"""Get connected subgraph for multiple entities"""
query
=
f"""
MATCH (e:Entity)
WHERE e.id IN $entity_ids
CALL apoc.path.subgraphAll(e, {{maxLevel:
{
max_hops
}
}})
YIELD nodes, relationships
RETURN nodes, relationships
"""
return
self
.
graph
.
run
(
query
,
entity_ids
=
entity_ids
)
.
data
(
)
Phase 6: AI Integration & Hallucination Prevention
Goal
Use knowledge graph to ground LLM responses and detect hallucinations Knowledge Graph RAG : class KnowledgeGraphRAG : def init ( self , kg_api , llm_client ) : self . kg = kg_api self . llm = llm_client def retrieve_context ( self , query : str ) -

str :

Extract entities from query

entities

self . extract_entities_from_query ( query )

Retrieve relevant subgraph

subgraph

self . kg . get_subgraph ( [ e . id for e in entities ] , max_hops = 2 )

Format subgraph for LLM

context

self . format_subgraph_for_llm ( subgraph ) return context def generate_with_grounding ( self , query : str ) -

GroundedResponse : context = self . retrieve_context ( query ) prompt = f""" Context from knowledge graph: { context } User query: { query } Answer based only on the provided context. Include source entities. """ response = self . llm . generate ( prompt ) return GroundedResponse ( response = response , sources = self . extract_sources ( context ) , confidence = self . calculate_confidence ( response , context ) ) Hallucination Detection : class HallucinationDetector : def init ( self , knowledge_graph ) : self . kg = knowledge_graph def verify_claim ( self , claim : str ) -

VerificationResult :

Parse claim into (subject, predicate, object)

parsed_claim

self . parse_claim ( claim )

Query knowledge graph for evidence

evidence

self . kg . find_evidence ( parsed_claim . subject , parsed_claim . predicate , parsed_claim . object ) if evidence : return VerificationResult ( is_supported = True , evidence = evidence , confidence = evidence . confidence )

Check for contradictory evidence

contradiction

self . kg . find_contradiction ( parsed_claim ) return VerificationResult ( is_supported = False , is_contradicted = bool ( contradiction ) , contradiction = contradiction ) Key Principles 1. Start with Ontology Define your schema before ingesting data. Changing ontology later is expensive. 2. Entity Resolution is Critical Deduplicate entities aggressively. "Apple Inc", "Apple", "Apple Computer" → same entity. 3. Confidence Scores on Everything Every relationship should have a confidence score (0.0-1.0) and source. 4. Incremental Building Don't try to model entire domain at once. Start with core entities and expand. 5. Hybrid Architecture Wins Combine graph traversal (structured) with vector search (semantic) for best results. Common Use Cases 1. Question Answering : Extract entities from question Traverse graph to find answer Return path as explanation 2. Recommendation : Find similar entities via shared relationships Rank by relationship strength Return top-K recommendations 3. Fraud Detection : Model transactions as graph Find suspicious patterns (cycles, anomalies) Flag for review 4. Knowledge Discovery : Identify implicit relationships Suggest missing connections Validate with domain experts 5. Semantic Search : Hybrid vector + graph search Expand context via relationships Return rich connected results Technology Recommendations For MVPs (<10K entities) : Neo4j Community Edition (free) SQLite for metadata OpenAI embeddings FastAPI for API layer For Production (10K-1M entities) : Neo4j Enterprise or ArangoDB Pinecone for vector search Airflow for ETL GraphQL API For Enterprise (1M+ entities) : Neo4j Enterprise or TigerGraph Distributed vector DB (Pinecone, Weaviate) Kafka for streaming Kubernetes deployment Validation Checklist Ontology designed and validated with domain experts Graph database selected and set up Entity extraction pipeline tested (>85% accuracy) Relationship extraction validated Hybrid search (graph + vector) implemented Query API created and documented AI integration tested (RAG or hallucination detection) Performance benchmarks met (query <100ms for common patterns) Data quality monitoring in place Backup and recovery tested Related Resources

返回排行榜